Gradient descent#

In mathematics, a gradient is a vector that indicates the direction in which a multivariate function increases most rapidly. The gradient of a function is denoted as follows:

\[ \text{grad} \, u = \nabla u = \frac{{\partial u}}{{\partial x}} \overline{i} + \frac{{\partial u}}{{\partial y}} \overline{j} + \frac{{\partial u}}{{\partial z}} \overline{k} \]

In this notation, \(\nabla u\) represents the gradient of the function \(u\). Each component of the gradient is a partial derivative with respect to its corresponding variable \(x, y\) and \( z\), expressed as \(\overline{i}, \overline{j},\) and \(\overline{k}\) respectively, with a line above the symbols.

Gradient descent is an algorithm that, for any point, calculates the direction of the fastest decrease of a function (using the gradient of the function at that point), and then systematically computes new values of the function, moving in that chosen direction. If the decrease in the value of the function becomes too slow, the algorithm stops and indicates that it has found a minimum.

Algorithm (Gradient descent)

To solve the optimization problem

\[ \mathcal L(\boldsymbol w) \to \min\limits_{\boldsymbol w} \]

do the following steps:

  1. initialize \(\boldsymbol w\) by some random values (e.g., from \(\mathcal N(0, 1\)))

  2. choose tolerance \(\varepsilon > 0\) and learning rate \(\eta > 0\)

  3. while \(\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon\) do the gradient step

    \[ \boldsymbol w := \boldsymbol w - \eta\nabla\mathcal L(\boldsymbol w) \]
  4. return \(\boldsymbol w\)

https://developers.google.com/machine-learning/crash-course/images/LearningRateJustRight.svg

Fig. 1 Main idea of gradient descent algorithm#

The algorithm initializes parameters randomly, sets hyperparameters such as tolerance and learning rate, and iteratively updates the parameters using gradient descent until convergence, ultimately returning the optimized parameter vector.

https://media.tenor.com/7zVwMezTxiAAAAAC/gradientdescent-graph.gif

Fig. 2 Iteration process of gradient descent#

Note

Learning rate determines the size of the steps taken during the optimization and influences the convergence speed, thereby, a smaller learning rate may lead to more precise results but slower convergence.

Note

If condition \(\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon\) holds for too long, the loop in step 3 terminates after some number iterations max_iter.

Example#

Let’s consider the simple quadratic function \(f(x) = x^2\). The purpose is to demonstrate a function where gradient descent can converge to the minimum in just a few steps.

  1. Calculate the Gradient: \(\nabla f(x) = 2x\)

  • Initialize Parameters: Choose an initial value for x, e.g., \(x_{\text{old}} = 3.0\).

  • Set Learning Rate: Choose a small learning rate, e.g., \(\alpha = 0.1\).

  1. Update rule: \(x_{\text{new}} = x_{\text{old}} - \alpha \cdot \nabla f(x_{\text{old}})\)

  2. Iterative updates:

    Iteration 1:

    \(\nabla f(x_{\text{old}}) = 2 \cdot 3.0 = 6.0\)

    \(x_{\text{new}} = x_{\text{old}} - 0.1 \cdot 6.0 = 3.0 - 0.6 = 2.4\)

    Result: \(x_{\text{new}} = 2.4, f(x_{\text{new}}) = (2.4)^2 = 5.76\)

    Iteration 2:

    \(\nabla f(x_{\text{old}}) = 2 \cdot 2.4 = 4.8\)

    \(x_{\text{new}} = x_{\text{old}} - 0.1 \cdot 4.8 = 2.4 - 0.48 = 1.92\)

    Result: \(x_{\text{new}} = 1.92, f(x_{\text{new}}) = (1.92)^2 = 3.6864\)

  3. Final result:

    Iteration n:

    \(\nabla f(x_{\text{n-1}}) = 2 \cdot x_{\text{n-1}}\)

    \(x_{\text{n}} = x_{\text{n-1}} - 0.1 \cdot \nabla f(x_{\text{n-1}})\)

    Result: \(f(x_{\text{n}}) = (x_{\text{n}})^2\)

The process should show that the value of \(x\) approaches the minimum \((x=0)\) relatively quickly due to the simplicity of the quadratic function.

Nevertheless, the gradient descent has its drawbacks. It should be highlighted that 2 notable issues can be caused in practical applications. Similar to any vector, the negative gradient comprises both a direction and a magnitude. Depending on the specific function undergoing minimization, either one or both of these characteristics may pose challenges when employing the negative gradient as a descent direction.

Slow-crawling behavior of gradient descent#

Below we show an example run of gradient descent using a function

\[\begin{equation} g(w) = w^4 +0.1 \end{equation}\]

whose minimum is at the origin \(w=0\). This example highlights how the gradient magnitude influences step length in gradient descent. Steps are initially large away from a stationary point but become small, resembling crawling, near the function minimum. A deliberately set step length parameter \(\alpha = 10^{-1}\) for 10 steps accentuates this behavior. The left panel shows the original function, and the right panel displays steps from start (green) to final (red). The natural crawling near the minimum, due to vanishing gradient magnitude, hampers quick progress, as reflected in the cost function history plot.

https://iili.io/JTfn4ft.png

Fig. 3 Behavior of gradient descent near the minimum of a function#

The other case of improper performance is the zig-zagging behavior of gradient descent. The information can be read in the (provided source).

Adding Momentum#

Purpose of Momentum

When employing gradient descent, several challenges arise:

  1. Becoming stuck at a local minimum, a consequence of the algorithm’s greediness.

  2. Overshooting and overlooking the global optimum due to excessively rapid movement along the gradient direction.

  3. Oscillation, a phenomenon occurring when the function’s value remains relatively constant, resembling navigation on a plateau where the height remains the same regardless of the direction.

To address these issues, a momentum term denoted as \(\alpha\) is introduced into the expression for \(\Delta \textbf{w}\) to stabilise the learning rate while approaching the global optimum value.

In the following, the superscript i indicates the iteration number:

\[\Delta \textbf{w}^i = - \eta \nabla_\textbf{w} f(\textbf{w}^i) + \alpha \textbf{w}^{i-1}\]

From that we can derive the formula for momentum term:

\[\alpha = \frac{\Delta \textbf{w}^i + \eta \nabla_\textbf{w} f(\textbf{w}^i)}{\textbf{w}^{i-1}}\]

Gradient Descent for a Univariate Function#

Let’s initialize the objective function and its derivative. And set up gradient descent function.

Note

Derivative \(f'(x)\) is used to understand how function changes at a specific point, as it helps to figure out direction and speed toward function’s minimum.

def f(x):
    return x ** 4 - 4 * x ** 2 + 5 * x


def derivative_f(x):
    return 4 * x ** 3 - 8 * x + 5


def gradient_descent(iterations_limit, threshold, start, obj_func, derivative_f, learning_rate=0.05, momentum=0.5):
    point = start
    points = [start]
    values = [obj_func(start)]

    delta = 0
    i = 0
    diff = 1.0e10

    while i < iterations_limit and diff > threshold:
        delta = -learning_rate * derivative_f(point) + momentum * delta
        point += delta

        points.append(point)
        values.append(obj_func(point))

        diff = abs(values[-1] - values[-2])
        i += 1

    return points, values


start_point = 2
learning_rate = 0.05
momentum = 0.3

points, values = gradient_descent(100, 0.1, start_point, f, derivative_f, learning_rate, momentum)

Graphs below visualizes the gradient descent optimization process for given objective function using Plotly library. It allows users to interactively explore how the optimization trajectory changes based on different learning rates and momentum values.

Warning

A small learning rate can slow down convergence significantly. Conversely, if the learning rate is too large, the algorithm might overshoot the minimum, causing oscillations or divergence.

Gradient Descent for a Bivariate Function#

The same procedure was applied similarly to the case of univariate function considered earlier. But now let’s consider the function of two variable, because real-world scenarios mostly involve complex relationships between various factors. Therefore, further computations involve partial derivatives and finding a local minimum.

Note

Partial derivatives tell us how much a function changes with respect to just one variable, while keeping all other variables constant. In our case, we have gradient, which is a vector of partial derivatives of \(\mathcal{L}(\boldsymbol w)\) with respect to each point w:

\[\nabla\mathcal{L}(\boldsymbol{w}) = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial w_1}, \frac{\partial \mathcal{L}}{\partial w_2}, \ldots, \frac{\partial \mathcal{L}}{\partial w_n} \end{bmatrix}\]

def f(x, y):
    return x ** 3 + x ** 2 - y ** 3 - 4 * x + 22 * y - 5


def derivative_f_x(x):
    return 3 * x ** 2 + 2 * x - 4


def derivative_f_y(y):
    return -3 * y ** 2 + 22


def gradient_descent(iterations_limit, threshold, start, obj_func, derivative_f_x, derivative_f_y, learning_rate=0.05,
                     momentum=0.5):
    point = start
    points = [start]
    values = [obj_func(*start)]

    x = point[0]
    y = point[1]

    delta_x = 0
    delta_y = 0
    i = 0
    diff = 1.0e10

    while i < iterations_limit and diff > threshold:
        delta_x = -learning_rate * derivative_f_x(x) + momentum * delta_x
        delta_y = -learning_rate * derivative_f_y(y) + momentum * delta_x
        x += delta_x
        y += delta_y

        points.append([x, y])
        values.append(obj_func(*[x, y]))

        diff = abs(values[-1] - values[-2])
        i += 1

    return points, values


start_point = [4.5, 2]
learning_rate = 0.05
momentum = 0.5

points, values = gradient_descent(10, 0.01, start_point, f, derivative_f_x, derivative_f_y, learning_rate, momentum)

3D plots can visualize a trajectory of the optimization process on a surface for a bivariate function. This can help to see how the algorithm moves toward the minimum.

Alternative implementation of a gradient_descent_3d_visualization_plot function provides a more detailed visualization by annotating the gradient descent path with connecting lines between the points.

Below is implemented and drawn a 2d counter plot of previously implemented 3d visualization. On a 2d space the contour lines represent the objective function, and the annotated points show the steps taken by the algorithm during optimization.

Note

A 2D representation of 3D space makes it easier to interpret the function’s features and the trajectory of the optimization algorithm.

Gradient Descent for a Linear regression model#

https://iili.io/JTBGoej.webp

To implement the gradient descent for linear regression model we follow the steps below:

  1. Import testing dataset

  2. Set manually intercept and slope values for a linear regression model

  3. Implement Mean square error (MSE) (see regression metrics section) function and compute error value for the model.

from sklearn.model_selection import train_test_split
import pandas as pd

dataset_path = 'datasets/experience_salary.csv'
data = pd.read_csv(dataset_path)

intercept = 1
slope = 2


def linear_regression(x, intercept, slope):
    return intercept + slope * x


def mean_squared_error(x, intercept, slope, y_actual):
    n = len(x)
    total_error = 0

    for i in range(n):
        total_error += + (intercept + slope * x[i] - y_actual[i]) ** 2

    return total_error / n


experience = data[['experience']]
salary = data['salary']

X_train, X_test, Y_train, Y_test = train_test_split(experience, salary, test_size=0.2, random_state=42)
X = X_test.values.flatten()
Y = Y_test.values.flatten()

linear_regression_values = [linear_regression(x, intercept, slope) for x in X]
mse = mean_squared_error(X_test.values.flatten(), intercept, slope, Y)
print(f'MSE: {mse}')
MSE: 876.7019407082549

Defined function that computes partial derivatives with respect to intercept and slope:

  • We iterate over each data point in the input arrays x, y_actual

  • Computes the partial derivatives of the mean squared error with respect to the intercept (derivative_intercept) and slope (derivative_slope) using the formula for the derivative of the mean squared error:

    • For the intercept: \(\frac{\partial}{\partial \text{intercept}} \text{MSE} = \frac{2}{N} \sum_{i=1}^{N} (\text{intercept} + \text{slope} \cdot x[i] - \text{y_actual}[i])\)

    • For the slope: \(\frac{\partial}{\partial \text{slope}} \text{MSE} = \frac{2}{N} \sum_{i=1}^{N} (\text{intercept} + \text{slope} \cdot x[i] - \text{y_actual}[i]) \cdot x[i])\)

  • Return the average derivatives by dividing the accumulated derivatives by the number of data points n

def derivative_mean_squared_error(x, y_actual, intercept, slope):
    n = len(x)
    derivative_intercept = 0
    derivative_slope = 0

    for i in range(n):
        derivative_intercept += 2 * (intercept + slope * x[i] - y_actual[i])
        derivative_slope += 2 * (intercept + slope * x[i] - y_actual[i]) * x[i]

    return derivative_intercept / n, derivative_slope / n

Function below allows us getting updated values of intercept and slope parameter. The update involves subtracting the product of the learning rate and the respective partial derivative from the current values of the intercept and slope.

def gradient_descent(x, y, intercept, slope, learning_rate, num_iterations):
    for i in range(num_iterations):
        partial_derivative_intercept, partial_derivative_slope = derivative_mean_squared_error(x, y, intercept, slope)
        intercept -= learning_rate * partial_derivative_intercept
        slope -= learning_rate * partial_derivative_slope

    return intercept, slope

To call the gradient_descent function, initially we set arbitrary values for hyperparameters learning_rate and amount of iterations.

  1. In each iteration, the algorithm calculates the gradient of the cost function with respect to each parameter. The gradient points in the direction of the steepest increase in the cost.

  2. The parameters are then updated in the opposite direction of the gradient to move towards the minimum of the mean square error function. The update is proportional to the learning_rate, a hyperparameter that determines the size of the steps taken in each iteration.

learning_rate = 0.001
iterations = 1000

gd_intercept, gd_slope = gradient_descent(X, Y, intercept, slope, learning_rate, iterations)
print(gd_intercept, gd_slope)
1.9080791607526073 0.9313247403333909

The code below prints the mean squared error and the predicted values for the testing data using the linear_regression model with gradient descent-optimized parameters.

gd_mse = mean_squared_error(X, gd_intercept, gd_slope, Y)
print(f'MSE with gradient descent: {gd_mse}')

gd_linear_regression_values = [linear_regression(x, gd_intercept, gd_slope) for x in X]
MSE with gradient descent: 29.349835951481545

Test the Mean square error (MSE) result on built-in linear regression model from sklearn library.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
mse = mean_squared_error(Y_test, Y_pred)
print(f'Sklearn mse {mse}')
Sklearn mse 27.65026873284228

Let’s define a function that draws a fit-line linear relationship after having applied the optimization.